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[57] ABSTRACT 

A system that intelligently abstracts and archives a 
document for storage and interprets a free form user 
retrieval query to recall the document from the storage 
file. The system includes a method for automatically 
selecting keywords from the document using a parts of 
a speech directory. A method is given for weighing the 
importance or centrality of each keyword with respect 
to the document of its origin. Using the same logic 
paths, a free form query that describes the document in 
the same manner that it would have to be described to 
a secretary to "find" it in a filing cabinet, the system 
automatically determines the key matching terms and 
finds the archived documents) with the greatest affin- 
ity. 

11 Claims, 3 Drawing Figures 
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words that transfrom the "boiler plate" of business cor- 
respondence into the message that the author wishes to 
convey. These terms consist mainly of numerics, proper 
names, acronyms, nouns and single purpose adjectives. 
Any meaningful description of a document for query 
purposes must contain at least some of these terms 
which give the document its particular meaning. This 
invention includes a technique for reliably locating the 
message specialization terms in a document and forming 
an abstract of the document using these terms. The 
technique utilizes the data storage technology disclosed 
in U.S. Pat No. 3.995,254 issued Nov. 30, 1976 to W. S. 
Rosenbaum and incorporated herein by reference to 
store a dictionary of words for spelling verification, 



OFFICE CORRESPONDENCE STORAGE AND 
RETRIEVAL SYSTEM 

BACKGROUND OF THE INVENTION 

1. Field of the Invention 

This invention relates to information storage and 
retrieval and more particularly to methods of automati- 
cally abstracting, storing and retrieving documents 
using free form inquiry. 

2. Description of the Prior Art 
In implementing a document storage and retrieval 

system, the practicality and utility of such a facility is 

governed by the ease that respective documents are _ „ _ r o , 

catologed into the system and the efficiency with which 15 however, other dictionary storage methodologies could 
a user's request can be associated with the related docu- also be used. The specialization terms in the dictionary 
ment catalog representation (description). State of the memory additionally have a data bit appended to them 
art document storage and retrieval is based on manually to indicate their status as a noun or single purpose adjec- 
selecting keywords to represent a document in the sys- live. Numerics, proper names, and acronyms are not 
tern's catalog or index and then effecting retrieval by 20 stored in the dictionary memory. The test of the docu- 
recalling from memory appropriate keyword terms and ment is compared with the contents of the dictionary 
either automatically or manually searching the index for memory and those words that compare to nouns and 
an ' appropriate" level of match against the prestored smgIc ^ in ^ dictionary ^ those 

keywords Procedures have been developed in the prior words (proper names, numerics, acronyms) not found in 
art for abstracting documents and retrieving them based 25 ' 
on keyword matching. One of the procedures requires 
the requestor to supply in a fixed format certain details 
about the subject document such as: author, addressee, 

date and keywords or phrases. For retrieval, a summary . M _ +/ x • v u *u a ^ u 

sorted listing is prepared under each of the above head- 30 ^ d«ument(s) m which the word occurs, the number 
ings. The requestor must discern the appropriate docu of ^ Word occura 10 each ™P*ttve document. 



the dictionary memory are accumulated to form an 
abstract of the document. Each word in the abstract is 
then stored in a word index file. Records in the word 
index file include the word, the identification code of 



ment by examining the entries under the retrieval infor- 
mation headings. No latitude is allowed in the search 
clues. The search may be done by manual perusal or 
using data processing global find commands. 

A second procedure stores all non-trivial words (i.e., 
ignores articles and pronouns, etc.) in a document as a 
totally inverted file. The document/line/word position 
of origin is maintained in the catalog. Search of the 



respective document, 
an indicator as to whether the word is a numeric, proper 
name/acronym, noun/single purpose adjective, and an 
indicator as to whether the word occurs in the header, 
33 trailer, body or copy list of the document (A single 
purpose adjective is a word whose primary use is adjec- 
tival, for example heavy, round, old, new, the colors 
red, blue, etc.) The words in an input query for retrieval 
of a document are compared against the word index file. 



database for retrieval is effected by the user supplying 40 sincc ^mc words in the word index file may occur in 
keywords based on the user's memory. The catalog is ' J 

automatically searched with the added facility that the 
user can specify relations that must exist between the 
keywords as they exist in the original text (i.e., keyword 

1 is before keyword 2, etc.). An example of such a sys- 45 4**^ and 111 ose documents with highest scores arc 
tern is the IBM Data Processing Division product Stor- 
age and Information Retrieval System, commonly 
called STAIRS. 
A third method for document storage and retrieval is 



several documents, weighing factors are accorded each 
word based on the information stored with the word in 
the word index file. A score is accumulated for each 
document that contains any of the words in the retrieval 



presented to the user for review. 
BRIEF DESCRIPTION OF THE DRAWINGS 



FIG. 1 is a block diagram of system components in 
simply storing the document in machine readable form 50 the document storage and retrieval system, 
and searching all documents using a "global find" logic 
for each user supplied keyword. In theory and in prac- 
tice for small data bases, the "global find" can be re- 
placed by the user reviewing the documents verbatim as 
they are displayed on a CRT type device. 55 

However, in all the above procedures for document 
storage and retrieval, the major intelligent burden for 
abstraction and retrieval association matching is put on 
the user. Where the system aids in abstraction or match- 
ing, it is done at the cost of voluminous cataloging pro- 60 eludes a processor or CPU 10 of the general purpose 



FIG. 2 is a flow chart of the operation in abstracting 
and storing a document. 

FIG. 3 is a flow chart of the operation of the system 
in retrieving a document in response to a user query. 

DESCRIPTION OF THE PREFERRED 
EMBODIMENT 

Referring to FIG. 1 there is shown a block-diagram 
of a document storage and retrieval system which in- 



cedures, massive data processing burden and a structure 
format is required for the user to communicate for re- 
trieval with the system. 

SUMMARY OF THE INVENTION 

It has been discovered that all non-trivial correspon- 
dence is made topic specific by a relatively small num- 
ber of message specialization terms. These are the 



type which is capable of decoding and executing in- 
structions. The processor 10 is in two-way communica- 
tion over bus 13 with a memory 14 containing instruc- 
tions which control its operation and define the present 
65 invention. The processor 10 is also in a two-way com- 
munication over bus 7 with memory 8 which contains a 
partial speech dictionary where all nouns and single 
purpose adjectives are so noted. The memory 8 contains 
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no numerics, acronyms or proper names. The processor play, printer or voicecoder, etc. The selected docu- 
10 is also in two-way communication over bus 11 with ment(s) are then presented to the user for review, 
main memory 12 which is used for storing the docu- The preferred embodiment of the present invention 
mcnts and key word index files. The instruction mem- comprises a set of instructions or programs for control- 
ory 14 and dictionary memory 8 may be of the read only 5 ling the document abstracting, archiving and query 
storage or random access storage type, while the main statement affinity match for retrieval for the document 
memory 12 is of the random access storage type. storage and retrieval system shown in FIG. 1. Referring 

For document abstracting and archiving an input to FIG. 2 there is shown a flow chart of the programs 
register 16 receives the text words from a source (not for abstracting and archiving documents, 
shown) over bus 17. The source may be any of various 10 It is standard practice in data processing systems 
input devices including keyboard, magnetic tape reader, having on-line storage to assign each record stored a 
magnetic cards/disk/diskette files, etc. Text words are unique identifier code or number. This code is usually 
presented to processor 10 by register 16 over bus 15 for eight characters in length and does not contain informa- 
processing in accordance with instructions stored in tion that is descriptive of the contents of the record 
instruction memory 14. The results of the processing 15 because of the limited length. The identifier code is 
(abstraction) performed on the text contents of register useful for accessing the records where the user is able to 
16 are transmitted to memory 12 over bus 11. associate the identifier code with a particular record. 

For document retrieval, input register 16 receives the However, this technique for locating a record become 
query text statement from a source (not shown) over impractical where the data base is large and several 
bus 17. The source may be any of various input devices 20 users have access to the same records. A record usually 
such as a keyboard, script table, or especially consti- retains the same identifier code throughout its existence 
tuted touchtone pad. The query statement text is pres- and modifications to the record replace the record in 
ented to processor 10 by register 16 over bus 15 for storage under the same identifier code. The program for 
processing in accordance with instructions stored in * abstracting and archiving documents makes use of the 
instruction memory 14. The processor 10 under control 25 ' identifier code by including it as part of the abstract 
of instructions from instruction memory 14 communi- record. When a document is entered into the System, 
cates with the contents of dictionary memory 8 over bus FIG. 2, the document identifier code or number for the 
7 and memory 12 over bus 11 to perform a document document is read at block 20 and the word index files 
retrieval affinity evaluation on the contents of memory vajready^stored in the system are compared to determine 
12. The selected document(s) are transmitted from 30 [if a match is found indicating that an abstract is cur- 
memory 12 over bus 11 and bus 9 to output register 18 rently stored for the document. 

TABLE 1 

Document Abstraction Routine 



BEGINPROCEDURE(OCRS_ ABSTRACT); 

ENTER ABSTRACT, SAVE DOCUMENT NUMBER PARAMETER; 

READ DOCUMENT ABSTRACT FILE RECORD FOR DOCUMENT NUMBER; 

IF 

RECORD FOUND 
THEN 

CALL (DELETE_ ABSTRACT); 

ENDIF; 

WHILE 

NOT END OF DOCUMENT 
DO 

WHILE 

NOT END OF PAGE 
DO 

GET NEXT LINE OF TEXT FROM THE DOCUMENT; 
WHILE 

MORE CHARACTERS EXIST ON THE LINE 
DO 

GET NEXT WORD FROM THE LINE (2 OR MORE 
CONSECUTIVE CHARACTERS A-Z, 0-9, OR 

y. 

IF 

THE WORD IS "CC* 
THEN 

SET CC LINE NUMBER TO THE DOCUMENT 

LINE NUMBER MINUS I; 

ENDIF; 

CALL (ABSTRACT_PROCESS_WORD); 
ENDWHILE; 

INCREMENT PAGE NUMBER BY 1; 
ENDWHILE: 

INCREMENT DOCUMENT LINE NUMBER BY 1; 
ENDWHILE; 

SET LAST BODY LINE COUNT TO THE LESSOR OF: 

THE CC LINE NUMBER AND THE DOCUMENT LINE NUMBER; 

DECREMENT THE LAST BODY LINE COUNT BY 4; 

. CALL (ABSTRACT_END PROCESSING); 

ENDPROCEDURE<OCRS_ABSTRACT)i 



and from output register IB over bus 19 to a utilization Table 1 is the program routine in Program Design 
device which may take various forms, including a dis- Language (PDL) for abstracting the document. If the 
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document number (identifier code) is found to exist in tari f i_^ n t; n „«i 

the abstract file, the program routine branches to the 1ADLt J - conlinuM 



delete abstract routine of Table 2 which is shown as Aburact Process. Word Subroutine 

block 22 of the flow chart of FIG. 2. IF 
TABLE 2 



Delete Abstract Subroutine 



BEGINPROCEDURE(DELETE ABSTRACT); 

ENTER DELETE ABSTRACT; 
WHILE 

NOT END OF DOCUMENT ABSTRACT RECORD 
DO 

GET THE NEXT ENTRY IN THE DOCUMENT ABSTRACT RECORD; 

READ THE WORD INDEX RECORD FOR THE WORD; 

WHILE 

NOTE END OF WORD INDEX RECORD 
DO 

GET THE NEXT ENTRY IN THE WORD INDEX RECORD; 
IF 

THE DOCUMENT NUMBER IN THE ENTRY IS THE 
SAME AS THE DOCUMENT NUMBER FROM 
THE DOCUMENT ABSTRACT RECORD 
THEN 

REMOVE THE ENTRY FROM THE WORD INDEX 

RECORD; 

IF 

THERE ARE NOW NO ENTRIES IN THE WORD 

INDEX RECORD 

THEN 

DELETE THE WORD INDEX RECORD FROM 

THE FILE; 

ELSE 

REWRITE THE WORD INDEX RECORD TO THE 

FILE; 

ENDIF; 

ENDIF: 

END WHILE; 

ENDWHILE; . 

DELETE THE DOCUMENT ABSTRACT RECORD FROM THE FILE; 
ENDPR0CEDURE(DELETE_ABSTRACT); 



The delete abstract subroutine of Table 2 deletes the 
abstract from memory by deleting occurrences of the 
words in the abstract from the word index file. The 
makeup of the word index file will be fully explained 
below. 40 

Following deletion of the existing abstract from 
memory, or, if no words having the document number 
are stored in the word index file, the document is pro- 
cessed at block 23 to create an abstract. Referring to the 
program routine in Table 1, the next word in the docu- 45 
ment is tested to determine if the Carbon Copy (CC) list 
follows. If not, the program branches to abstract pro- 
cess word routine in Table 3 to determine if the word 
should be included in the abstract for the document 

TABLE 3 50 

Abstract Process Word Subroutin e 

BEG IN PROCEDU RE( A BSTRACT_PROCESS WORD); 
ENTER PROCESS WORD 

INCREMENT DOCUMENT WORD COUNT BY I; 

LOOK THE WORD UP IN THE DICTIONARY; 55 

IF 

THE WORD WAS FOUND IN THE DICTIONARY BUT 
NOT FLAGGED AS A NOUN OR A SINGLE 
PURPOSE ADJECTIVE 
THEN 

IGNORE THIS WORD; 60 

ELSE 

IF 

THE WORD WAS FOUND IN THE DICTIONARY BUT 
FLAGGED AS A NOUN OR A SINGLE PURPOSE 
ADJECTIVE 

THEN 6 , 

FLAO THE WORD AS NORMAL; 

ELSE 

FLAO THE WORD AS ACRONYM; 
ENDIF; 



THIS WORD HAS NOT BEEN FOUND PREVIOUSLY IN 
THIS DOCUMENT 
THEN- 
SAVE THIS WORD; 

SAVE THE DOCUMENT LINE COUNT; 

SET FREQUENCY COUNT FOR THIS WORD TO 1; 

ELSE 

INCREMENT FREQUENCY COUNT FOR THIS WORD BY I, 

ENDIF; 

ENDIF; 

ENDPROCEDURE(ABSTRaCT_ PROCESS _WQRD); 

As was previously stated, the criteria for determining 
whether a word is included in the abstract is whether 
the word is determined to be a "message specialization 
term", i.e., a noun, single purpose adjective, proper 
name, acronym, or numeric. The program routine of 
Table 3 compares the word to the contents of dictionary 
memory 108 (FIG. 1). If the word is found in the dictio- 
nary memory but it is not a noun or single purpose 
adjective then the word is ignored. The decision as to 
whether a word in the dictionary is a noun or single 
purpose adjective is made at the time of preparation of 
the dictionary memory 8 and those words designated as 
nouns or single purpose adjectives have appended to 
them a code bit. If the word is determined to be a noun 
or single purpose adjective, a code bit or "flag" is added 
to the word to indicate as "normal". If the word is not 
in the dictionary then a code bit or "flag" is added to the 
word to indicate its status as acronym or proper name. 
Acronyms and proper names are considered to have 
more influence as message specialization terms than 
nouns and single purpose adjectives and therefore are 
more useful for document retrieval as will be shown 
below. The Process Word routine of Table 3 controls 
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the processor 10 to save only one copy of each abstract 
term for storage in the word index file. However, the 
Process Word routine appends to the word the number 
of each line in the document where the word appears 
and a count of the number of times the word appears in 5 
the document. As will be seen below for document 
retrieval, the frequency of occurrence of the word in 
the document and the place of occurrence help deter- 
mine the value of the word as a query term for retriev- 
ing the document. 10 

Following completion of the Word Process subrou- 
tine control returns to the Abstract routine in Table 1 
which repeats the routines for each word in the docu- 
ment. The Abstract routine accumulates a count for the 
number of pages in the document. Upon reaching the 15 
end of the document a count is calculated to determine 
the fifth line from the end of the body of the document 
and the Abstract End Processing subroutine of Table 4 
is selected. 

TABLE 4 



To retrieve a document stored in the system, the 
requestor must enter a query for the document into the 
system. This may be done through a keyboard, for ex- 
ample. The queries used with the preferred embodiment 
of this system can be a natural language statement or 
string of words that describes the item. The search 
argument is created by testing the query words against 
the word index file. In many cases, the words in the 
search argument will occur in the key word records 
(abstracts) of several documents. In order to provide 
better discrimination between contending documents, 
different weights are applied to different key words. 
Weighting criteria are applied according to these gen- 
eral rules: 

1 — Matches on numeric key words are given greater 
weight than matches on alpha key words. 

2 — Matches with key words that are proper names or 
acronyms are given greater weight than matches with 
nouns or single purpose adjectives that are found in the 



Abstract End Processing Subroutine 



BEGWPR0CEDURE(ABSTRACT_END_PR0CESSING); 
ENTER END PROCESSING; 

CREATE A DOCUMENT ABSTRACT RECORD CONSISTING OF; 
THE DOCUMENT NUMBER. THE DOCUMENT WORD COUNT, AND 
EACH WORD IN THE ABSTRACT; 

WRITE THE DOCUMENT ABSTRACT RECORD TO THE FILE; 
WHILE 

MORE WORDS ARE LEFT TO PROCESS; 
DO 

READ THE WORD INDEX RECORD FOR THE WORD; 
IF 

THE RECORD WAS NOT FOUND 
THEN 

CREATE A WORD INDEX RECORD CONSISTING OF: 
THE WORD. THE NORMA L/ ACRONYM/PROPER NAME 
FLAG, THE DOCUMENT NUMBER. THE FREQUENCY 
COUNT. AND A FLAG INDICATING IN HEADER/ 
TRAILER/CC LIST/BODY; 

WRITE THE WORD INDEX RECORD TO THE FILE; 
ELSE 

ADD THE DOCUMENT NUMBER, THE FREQUENCY COUNT, 
AND A FLAG INDICATING IN HEADER/TRAILER/CC 
LIST/BODY TO THE RECORD; 

REWRITE THE WORD INDEX RECORD TO THE FILE; 

ENDIF; 

END WHILE; 

ENDPR0CEDURE(ABSTRACT_END PROCESSING); 



The Abstract End Processing subroutine controls the 
processor 10 to create an abstract record which in- 
cludes all words saved by the Process Word subroutine 
of Table 3, a count of the number of words in the docu- 50 
ment and the document identifier code number. The 
Abstract End Processing subroutine also creates a 
Word Index Record for each word in the abstract re- 
cord which includes the word, the "normal" or 
"acronym/proper name" code, the document number, 55 
the number of pages in the document, the frequency of 
occurrence of the word in the document, and a code 
indicating whether the word occurs in the header (first 
10 lines), trailer (last 5 lines) or the copy list or body of 
the document. The words in the Word Index File are 60 
searched to determine if a record for the word already 
appears in the Word Index File. If it does then the re- 
cord is updated by adding the document number, fre- 
quency count, and codes such that no duplicates of the 
word appear in the Word Index File. Following com- 65 
plction of the Abstract End Processing subroutine of 
Table 4 control returns to the Abstract routine of Table 
1 which terminates the abstracting procedure. 



dictionary memory. 

3— The weight assigned to a key word match is pro- 
portional to the number of times that the word occurred 
in the document divided by the log of the number of 
pages in the document. 

4— Matches with key words that occur in the first ten 
lines of the document are given greater weight than 
those of key words in the center of the body of text 

5 — Matches that occur with key words in the last five 
lines of text (before any copy lists) are given more 
weight than matches with words in the center of the 
text, but less weight than matches with words in the 
first ten lines. 

6 — The weight of a key word match is increased 
when that word is the name of a month or year. 

7 — The weight of a key word match is inversely 
proportional to the number of documents in the entire 
file that contain that key word in the body of the docu- 
ment (excluding occurrences as part of the copy list). 

The rationale behind these general rules is to give the 
greatest weight to those matches that involve key 
words that have the most narrowly specific meaning. It 
is assumed that specific names, numbers and dates have 
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very specific meaning so they are weighed heavily. It is 
also assumed that the most specific items will be men- 
tioned at the beginning or end of the correspondence. 
Hence, words occurring in these regions are also given 
greater weight. An example of an expression that satis- 
fies the general rules is the following: 



10 

TABLE 6-continued 



Match Value ■ 



10 



20 



25 



where: 

Fy= number of times ith key word appears in jth 
document divided logj of the number of pages in docu- 15 
ment. 

A/ = binary indicator if ith key word is an acronym or 
proper name. 

Kj— binary indicator if ith key word occurs in first 10 
lines. 

L/=binary indicator if ith key word is a numeric. 
E/= binary indicator if ith key word occurs in last 5 
lines. 

H/= binary indicator if ith key word occurs in the 
dictionary as a noun or single purpose adjective. 

Mi=binary indicator if ith key word is a month. 

Yf= binary indicator if ith key word is a year. 

D/= number of documents that contain ith key word. 

Referring to FIG. 3, a flow chart of the processing of 
a query for a document is shown. At block 30 the user 
query is input to the processor 10 (FIG. 1) from input 
register 16 over bus 15. Tables 5, 6, and 7 show program 
routines for processing the user query according to the 
general rules stated above. 

TABLE 5 

Query Roatine 

BEQlNPROCEDURE(OCRS_QUERY> t 
ENTER QUERY; 
WHILE 

MORE QUERY LINES OF TEXT EXIST 
DO 

GET THE NEXT LINE OF QUERY TEXT; 
WHILE 

MORE CHARACTERS EXIST ON THE LINE 
DO 

GET THE NEXT WORD FROM THE LINE (2 OR MORE 
CHARACTERS A-Z, 0-9, OR y, 

READ THE WORD INDEX RECORD FOR THE QUERY 
WORD 
IF 

WORD FOUND 
THEN 

CALL <QUERY_PROCESS_WORD); 
ENDIF 
END WHILE; 
END WHILE; 

CALL <QUERY_END_PROCESSINO); 
ENDPROCEDURE(OCRS_ QUERY); 



35 



40 



45 



50 



55 



The Query routine of Table 5 compares the query 
words to the contents of the word index file as shown in 
block 31 of the flow diagram of FIG. 3. The query 60 
words that match the word index file are processed at 
block 32 of the flow diagram by the Query Word Pro- 
cess subroutine of Table 6. 



TABLE 6 



Query Prooen Word Subroutine DettUed Logic 

DEGINPROCEDURE(QUERY_PROCESS_WORD); 
ENTER PROCESS WORD; 



65 



Query Proom Word Subrourioe Detailed Logic 

IF 

THE WORD IS A YEAR 
THEN 

SET INDICATOR FOR YEAR IN QUERY; 

ENDIF; 

IF 

THE WORD IS A MONTH 
THEN 

SET INDICATOR FOR MONTH IN QUERY; 

ENDIF; 

IF 

THE WORD IS NUMERIC 
THEN 

SET NUMBER WEIOHT TO 10; 
ELSE 

SET NUMBER WEIOHT TO 0; 

ENDIF; 

THEN 

COUNT THE NUMBER OF DOCUMENTS CONTAINING 
THIS WORD; 

COUNT THE NUMBER OF DOCUMENTS WHERE 

THE WORD IS NOT IN THE CC LIST; 

IF 

THE WORD INDEX RECORD IS FLAGGED AS AN 

ACRONYM/ 

PROPER NAME 

THEN 

SET ACRONYM/PROPER NAME WEIOHT TO 10; 
ELSE 

SET NORMAL WEIGHT TO 3; 

ENDIF; 

WHILE 

MORE DOCUMENT ENTRIES ARE IN THE WORD INDEX 

RECORD 

DO 

GET THE NEXT DOCUMENT ENTRY FROM THE WORD 

INDEX RECORD 

IF 

THE FLAO INDICATES THAT THE WORD OCCURRED 

IN THE HEADER 

THEN 

SET HEADER WEIGHT TO 10; 
ELSE 

SET HEADER WEIOHT TO 0; 

ENDIF; 

IF 

THE FLAO INDICATES THAT THE WORD OCCURRED 

IN THE TRAILER 

THEN 

SET TRAILER WEIGHT TO 5; 
ELSE 

SET TRAILER WEIOHT TO 0; 

ENDIF; 

IF 

THE FLAG INDICATES THAT THE WORD OCCURRED 

IN THE CC LIST 

THEN 

SET CC DIVIDE WEIGHT TO 99.999; 
ELSE 

SET CC DIVIDE WEIGHT TO 1; 
ENDIF; 

SET THE RETRIEVAL VALUE TO: 
{ACRONYM/PROPER NAME WEIGHT + NUMBER 
WEIGHT + NORMAL WEIGHT + HEADER WEIGHT + 
TRAILER WEIOHT + WORD FREQUENCY DIVIDED 
BY THE LOG BASE 2 OF COUNT OF NUMBER OF 
PAGES) DIVIDED BY THE LOO BASE 2 OF THE 
COUNT OF DOCUMENTS NOT CONTAINING THE 
WORD IN THE CC LIST; 

DIVIDE THE RETRIEVAL VALUE BY THE CC DIVIDE 

WEIGHT; 

IF 

THIS DOCUMENT HAS NOT BEEN ANALYZED YET 

IN THIS QUERY 

THEN 

SAVE THE DOCUMENT NUMBER; 
SAVE THE RETRIEVAL VALUE; 
ELSE 

INCREMENT THE DOCUMENTS RETRIEVAL VALUE 

BY THE NEW RETRIEVAL VALUE; 

ENDIF; 
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Query Process Word Subroutine Detailed Logic 
END WHILE; 

ENDPROCEDURE(QUERY_PROCESS WORD); . 



Each query word is tested to determine if it is a 
month, year, numeric, acronym or normal (noun or 
single purpose adjective). The subroutine of Table 6 
also adds weighting factors if the indicators in the word 10 
index file show the word occurs in the first ten lines 
(Header) of the document, last five lines (Trailer) of the 
document, or occurs more than once in the document 
The value of the word is reduced if it occurs in the copy 
list of the document or occurs in more than one docu- 15 
menL An overall calculation of value for each word is 
calculated and a total value for all query words that 
match words in the word index file for each document 
number having any matches is accumulated. The steps 
of calculating the retrieval value for words and the 20 
retrieval value for documents are shown in block 33 and 
34 of FIG. 3. Following processing of all words in the 
query, the Query routine of Table 5 branches to the 
Month/Year Evaluation subroutine of Table 7. 

TABLE 7 23 

Query Month/Yew Evaluation 



30 



35 



BEG INPROCEDURE(QUERY_END_PROCESSING); 
ENTER END PROCESSING; 
IF 

THERE WAS A YEAR MENTIONED IN THE QUERY 
THEN 

INCREMENT THE RETRIEVAL VALUE OF EACH 
DOCUMENT THAT DID CONTAIN THE YEAR BY 20%: 
ENDIF; 
IF 

THERE WAS A MONTH MENTIONED IN THE QUERY 
THEN 

INCREMENT THE RETRIEVAL VALUE OF EACH 
DOCUMENT THAT DID CONTAIN THE MONTH BY 20%; 
ENDIF; 

RETRIEVE THE DOCUMENT NUMBERS OF THE 

DOCUMENTS WHOSE RETRIEVAL VALUE IS WITHIN 40 

25% OF THE HIGHEST RETRIEVAL VALUE; 

SORT THIS LIST BY THE NUMBER OF WORDS FROM 

THE QUERY ACTUALLY OCCURRING 

IN THE DOCUMENT; 

OUTPUT THE DOCUMENTS; 

ENDPROCEDURE(QUERY END_PROCE5SINO); 4J 

The subroutine of Table 7 increases the retrieval 
value for each document that contains a year and/or 
month that matches a year and/or month in the query. 
The subroutine of Table 7 then controls the processor 50 
10 to output those documents from main memory 12 to 
output register 16 whose retrieval value is within 25 
percent of the highest retrieval value calculated. Con- 
trol is then returned to the Query routine of Table 5 
which terminates the query procedure. 55 

White the invention has been shown and described 
with reference to a specific set of computer instructions, 
i.e. PDL, and retrieval weighting values, it will be un- 
derstood by those skilled in the art that the spirit of this 
invention can be implemented in other computer lan- 60 
guages and the set of document retrieval weighting 
factors can be modified without avoiding the scope of 
the invention claimed herein. 

What is claimed is: 

1. A method for abstracting and archiving a docu- « 
ment in machine readable form comprising the steps of: 
(a) storing a dictionary of language terms commonly 
used in document preparation; 



(b) appending codes to the language terms in said 
dictionary of language terms to identify selected 
parts of speech; 

(c) comparing the language terms in an input docu- 
ment with the stored dictionary of language terms; 

(d) selecting language terms from said input docu- 
ment which do not compare to the stored dictio- 
nary of language terms; 

(e) selecting language terms from said input docu- 
ment which compare with language terms in said 
stored dictionary of language terms identified as 
selected parts of speech; 

(0 coding the selected language terms with the iden- 
tity of the input document; and 

(g) storing the selected language terms for later re- 
call. 

2. The method of claim 1 further including the steps 
of accumulating a count for the number of times each of 
the selected language terms occurs in the input docu- 
ment and accumulating a count of the number of pages 
in the input document. 

3. The method of claim 1 or claim 2 further including 
the step of appending to each selected language term a 
code indicating the position of occurrence of the se- 
lected language term in the input document. 

4. A method for retrieving a document from storage 
in response to input language terms descriptive of the 
content of the document comprising the steps of: 

(a) comparing each of the input language terms to 
stored document abstract files of language terms, 
each document abstract language term having asso- 
ciated with it a code identifying its part of speech, 
a count indicating its frequency of occurrence in 
the document, a count of the number of pages in 
the document, and an indicator of the position of 
occurrence of the term in the document; 

(b) accumulating a retrieval record for each docu- 
ment abstract file composed of (he language terms 
that compare equal; 

(c) calculating a document retrieval value for each 
retrieval record using the part of speech code, 
frequency count, number of pages in the document, 
and position indicator for each language term in 
the retrieval record; 

(d) increasing the document retrieval value for each 
retrieval record that includes a month and/or year; 
and 

(e) selecting the document corresponding to the high- 
est calculated retrieval value for output 

5. The method of claim 4 further including the step of 
selecting all documents whose calculated retrieval 
value is equal to or greater than a predetermined per- 
centage of the highest calculated retrieval value. 

6. A system for abstracting a document in machine 
readable form comprising: 

means for storing a dictionary of language terms 
commonly used in document preparation, said lan- 
guage terms including a code identifying certain 
ones of said language terms as selected parts of 
speech; 

means for receiving an input document of language 
terms in machine readable form, said input docu- 
ment including an identification code; 

a memory; 

control means connected to said means for storing, 
said means for receiving and said memory, includ- 
ing, 
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means for comparing the language terms of said input 
document to said dictionary of language terms, 

first selecting means responsive to said means for 
comparing for selecting the language terms from 
said input document that compare unequal, 5 

second selecting means responsive to said means for 
comparing for selecting the language terms from 
said input document that compare equal and are 
coded as selected parts of speech; 

first counting means responsive to said first and sec- 10 
ond selecting means for counting the frequency of 
occurrence of each selected language term in the 
input document; 

second counting means responsive to said means for 15 
receiving for counting the number of pages in the 
document; 

means responsive to said first and second selecting 
means for calculating the position of occurrence of 
the selected language terms in the input document; 20 
and 

means responsive to said first and second selecting 
means, said first and second counting means, and 
said means for calculating for storing in said mem- 
ory a record of each selected language term includ- 25 
ing the document identification code, the language 
term, the selected part of speech code, the fre- 
quency of occurrence count, the count of pages in 
the document, and the position of occurrence code. 

7. The system of claim 6 wherein said control means 30 
further includes means for comparing each selected 
language term from the input document to selected 
language terms currently stored in said memory, and 
means responsive to an equal compare for adding to the J5 
record of the selected language term stored in said 
memory the identification code of the input document, 
the frequency of occurrence count, and position of 
occurrence code for the selected language term, 
thereby eliminating the need for duplicate storage of the 40 
selected language term. 

8. A system for retrieving a document from storage in 
response to an input query of language terms descrip- 
tive of the content of the document comprising: 
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a memory having stored therein language term re- 
cords including the language term, identification 
codes of documents containing the language term, 
a selected parts of speech code, a frequency of 
occurrence count for the language term, a count of 
pages in each document, and a position of occur- 
rence code for each document identification code 
in each language term record; 
means for comparing the language terms of the input 
query to language term records stored in said mem- 
ory; 

means for accumulating a retrieval record for each 
document identification code of each language 
term that compares equal; 
means responsive to said means for accumulating for 
calculating a document retrieval value for each 
retrieval record using the selected part or speech 
code, frequency of occurrence count, count of 
pages and position of occurrence code; and 
means responsive to said means for calculating for 
outputting from memory the document whose 
identification code corresponds to the identifica- 
tion code for the highest calculated retrieval value. 

9. The system of claim 8 wherein said means for cal- 
culating further includes means for increasing the docu- 
ment retrieval value for each retrieval record that in- 
cludes a month that compares equal to a term in the 
input query and further increasing the document re- 
trieval value for each record that includes a year that 
compares equal to a term in the input query. 

10. The system of claim 8 or claim 9 wherein said 
means for calculating includes means calculating a per- 
centage of the highest calculated retrieval value for 
each document identification code and said means for 
outputting further includes means for outputting all 
documents whose retrieval value exceeds a predeter- 
mined percentage of the highest calculated retrieval 
value. 

11. The system of claim 10 wherein means for output- 
ting further includes means for selecting documents for 
display in the descending order of the number of query 
terms that matched langugage term records for the 
document. 



SO 
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